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Abstract 

In an earlier work, the statistical physics associated with finite-temperature decoding of 
code ensembles, along with the relation to their random coding error exponents, were explored 
in a framework that is analogous to Derrida's random energy model (REM) of spin glasses, 
according to which the energy levels of the various spin configurations are independent random 
variables. The generalized REM (GREM) extends the REM in that it introduces correlations 
between energy levels in an hierarchical structure. In this paper, we explore some analogies 
between the behavior of the GREM and that of code ensembles which have parallel hierarchical 
structures. In particular, in analogy to the fact that the GREM may have different types of 
phase transition effects, depending on the parameters of the model, then the above-mentioned 
hierarchical code ensembles behave substantially differently in the various domains of the design 
parameters of these codes. We make an attempt to explore the insights that can be imported 
from the statistical mechanics of the GREM and be harnessed to serve for code design consid- 
erations and guidelines. 



Index Terms: Spin glasses, GREM, phase transitions, random coding, error exponents. 



1 Introduction 

In the last few decades it has become apparent that many problems in Information Theory have 
analogies to certain problems in the area of statistical physics of disordered systems. Such analogies 
are useful because physical insights, as well as statistical mechanical tools and analysis techniques 
can be harnessed in order to advance the knowledge and the understanding with regard to the 
information-theoretic problem under discussion. 
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One important example of such an analogy is between the statistical physics of disordered 
magnetic materials, a.k.a. spin glasses, and the behavior of certain ensembles of random codes for 
source coding (see, e.g., [l], [2], [3], [4]) and for channel coding (see, e.g., |5] and references therein, 

m, m, M, M, m, m m, m, m, m, m, m m, m, m)- 

Among the various models of interaction disorder in spin glasses, one of the most fascinating 
models is the random energy model (REM) , invented by Derrida in the early eighties [21] , [22] , [23] 
(see also, e.g., |20] . [24] . [25], for later developments). The REM is on the one hand, extremely 
simple and easy to analyze, and on the other hand, rich enough to exhibit phase transitions. 
According to the REM, the different spin configurations are distributed according to the Boltzmann 
distribution, namely, their probabilities are proportional to an exponential function of their negative 
energies, but the configuration energies themselves are i.i.d. random variables, hence the name 
random energy models. 

In [5l Chap. 6], Mezard and Montanari draw an interesting analogy between the REM and the 
statistical physics pertaining to finite temperature decoding [18] of ensembles of random block codes. 
The relevance of the REM here is due to the fact that in this context, the partition function that 
naturally arises has the log-likelihood function (of the channel output given the input codeword) 
as its energy function (Hamiltonian) , and since the codewords are selected at random, then the 
induced energy levels are random variables. Consequently, the phase transitions of the REM are 
'inherited' by ensembles of random block codes, as is shown in [5]. In [26], this subject was further 
studied and the free energies corresponding to the various phases were related to random coding 
exponents of the probability of error at rates below capacity and to the probability of correct 
decoding at rates above capacity. 

While the REM is a very simple and interesting model for capturing disorder, as described 
above, it is not quite faithful for the description of a real physical system. The reason is that 
according to the REM, any two distinct spin configurations, no matter how similar and close to 
each other, have independent, and hence unrelated, energies. A more realistic model must take 
into account the geometry and the structure of the physical system and thus allow dependencies 
between energies associated with closely related configurations. 



^More details on this and other terminology described in the remaining part of this Introduction, will be given in 
the Section [3l 
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This observation has motivated Derrida to develop the generalized random energy model (GREM) 
[27| (see also, e.g., [28], [29], [30], [3l], [32], [33], for later related work). The GREM extends the 
REM in that it introduces an hierarchical structure in the form of a tree, by grouping subsets of 
(neighboring) spin configurations in several levels, where the leaves of this tree correspond to the 
various configurations. According to the GREM, for every branch in this tree, there is an associated 
independent randomly chosen energy component. The total energy of each configuration is then 
the sum of these energy components along the branches that form the path from the root of the 
tree to the leaf corresponding to this configuration. This way, the degree of dependency between 
the energies of two different configurations depends on the 'distance' between them on the tree: 
More precisely, it depends on the number of common branches shared by their paths from the root 
up to the node at which their paths split. The GREM is somewhat more complicated to analyze 
than the REM, but not substantially so. It turns out that the number of phase transitions in the 
GREM depends on the parameters of the model. If the tree has k levels, there can be up to k 
phase transitions, but there can also be a smaller number. For example, in the case k = 2, under 
a certain condition, there is only one phase transition and the behavior of the free energy in both 
phases is just like in the ordinary REM. 

In analogy to the above described relationship between the REM and the statistical physics 
of random block codes, the natural question that now arises is whether the GREM and its phase 
transitions can give us some insights about the behavior of code ensembles with some hierarchical 
structure (e.g., tree-structured codes, successive refinement codes, etc.). In particular, in what way 
do these phase transitions guide us in the choice of the design parameters of these codes? It is the 
purpose of this paper to explore these questions and to give at least some partial answers. 

We demonstrate that there is indeed an intimate relationship between the GREM and certain 
ensembles of hierarchical codes. Consider, for example, a two-stage rate-distortion code of block 
length n = ni + ?7-2, where the first ni components of the reproduction vector, at rate Ri, depend 
only on the first uiRi bits of the compressed bitstream, and the last n2 symbols of the reproduction 
codeword, at rate R2, depend on the entire bitstream of length niRi + n2R2- The overall rate of 
this code is, of course, the weighted average of Ri and R2 with weights proportional to rii and 722, 
respectively. An ensemble of codes with this structure is defined as follows: First, we randomly 
draw a rate Ri codebook of block length ni according to some distribution. Then, for each resulting 
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codeword of length ni, we randomly draw a rate R2 codebook of block length n2cl Thus, the code 
has a tree structure with two levels, like a two-level GREM. The overall distortion of the code 
along the entire n symbols is the sum of partial distortions along the two segments, in analogy to 
the above described additivity of the partial energies along the branches of the tree pertaining to 
the GREM, and since the codewords are random, then so are the distortions they induce. 

The motivation for this class of codes, especially when the idea is generalized from two parts 
to a larger number of k parts, say, of equal length (ni = n2 = ■ ■ ■ = Uk = n/k), is that the delay, 
at least at the decoder, is reduced from n to n/Zc, because the decoder is causal in the level of 
segments of length n/k. The following questions now arise: Is there any inherent penalty, in terms 
of performance, for this ensemble of reduced delay decoding codes? If so, how can we minimize 
this penalty? If not, how should we choose the design parameters (i.e., Ui and Ri, i = 1, . . . ,k, for 
a given overall average rate R) such that this code will 'behave' like a full block code of length n? 

For simplicity, let us return to the case k = 2. For a given R and n, we have two degrees of 
freedom: the choices of Ri and ni (which will then dictate R2 and 712). Is it better to choose 
Ri > R2 or Ri < R2, if at all it makes any difference? A similar question can be asked concerning 
ni and 712. The answer depends, of course, on our figure of merit. Obviously, if one is interested 
only in the asymptotic distortion, the question becomes uninteresting, because then by choosing 
two independent codejfl for the two parts, both at rate R, the overall distortion will be given by 
the distortion-rate function, D{R), just like that of the full unstructured code. For a given n, 
of course, the redundancies will correspond to the shorter blocks rii and n2, but this is a second 
order effect. Here, we choose to examine performance in terms of the characteristic function of the 
overall distortion, £^[exp{— s • distortion}]. This is, of course, a much more informative figure of 
merit than the average distortion, because in principle, it gives information on the entire probability 
distribution of the distortion. In particular, it generates all the moments of the distortion by taking 
derivatives, and it is useful in deriving Chernoff bounds on probabilities of large deviations events 
concerning the distortion. In the context of the analogy with statistical physics and the GREM, 
this characteristic function can easily be related to the partition function whose Hamiltonian is 
given by the distortion. 

^Note that this is difFerent from using the same second-stage codebook for all first-part codewords, in which case, 
this is just a combination two codebooks of length ni and n2, operating independently, 
^c.f. footnote no. 2. 
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It turns out that the characteristic function of the distortion behaves in a rather surprisingly 
interesting manner and with a direct relation to the GREM. For Ri < R2, when the corresponding 
GREM has k = 2 phase transitions, the characteristic function of the distortion behaves like that 
of two independent block codes of lengths ni and 722 and rates Ri and R2, thus the dependency 
between the two parts of the code is not exploited in terms of performance. For Ri > R2, which 
is the case where the analogous GREM has only one phase transition (and behaves exactly like 
the ordinary REM, which is parallel to an ordinary random block code with no structure), the 
characteristic function behaves like that of a full unstructured optimum block code at rate R across 
a certain interval of small s, but beyond a certain point, it becomes inferior to that of a full code. 
For i?i = i?2 = R, it behaves like the unstructured code for the entire range of s > 0, but then one 
might as well use two independent block codes (and reduce the search complexity at the encoder 
from e"'^ to 2e"'^/^). The choices of rii and 722 are immaterial in that sense, as long as they both 
grow linearly with n. Thus, the conclusion is that it is best to use Ri = R2, but if communication 

then performance is better when 
Ri > R2 than when Ri < i?2- These results can be extended to the case of k stages. 

A parallel analysis can be applied to analogous ensembles of (reduced delay) channel encoders 

of block length n = ni + n2 (for the case k = 2), which have a similar tree structure: Here, the first 

ni channel letters of each block depend only on the first niRi information bits, whereas the other 

712 channel symbols depend on the entire information vector of length niRi + 7i2-R2- The random 

codebook is again drawn hierarchically in the same manner as before. If the code performance is 

judged in terms of the error exponent, then once again, the choice Ri > R2 is always better than 

the choice Ri < R2- Here, unlike the source coding problem, there is an additional consideration: 

There are two types of incorrect codewords that are competing with the correct one in the decoding 

process: those for which the first ni channel inputs agree with those of the correct codeword (the 

first segment is the same) and those for which this is not the case. In this case, R2 has to be chosen 

sufficiently small so that the error term contributed by erroneous codewords of the first kind would 

not dominate the probability of error. Considering the case ni = n2 = n/2, if the overall average 

rate is not too small, it is possible to choose Ri and R2 so that the error exponent of this ensemble 

* For example, this can be the case if there are additional users in the system and the bandwidth allocation for 
each user changes in a dynamical manner, or if different parts of the encoded information are transmitted via separate 
links with different capacities. 
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of codes is not worse than that of an ordinary random code with no structure. This idea can be 
extended to k stages in a straightforward manner. In fact, we propose a systematic procedure to 
allocate rates to the different stages in a way that guarantees that the error exponent would be 
at least as good as that of the classical random coding error exponent pertaining to an ordinary 
random code at rate R. 

The outline of this paper is as follows. In Section 2, a few notation conventions are described. 
In Section 3, we provide some more detailed background in statistical physics, with emphasis on 
the REM and the GREM. Finally, in Section 4, we present our main results on hierarchical code 
ensembles of the type described above, along with their relationship to the GREM. Readers who are 
not interested in the relationship with statistical physics (although this is one of the main points in 
the paper) may skip Section 3 and ignore, in Section 4, the comments on the statistical mechanical 
aspects, all this without essential loss of continuity. 

2 Notation Conventions 

Throughout this paper, scalar random variables (RV's) will be denoted by capital letters, like X 
and Y, their sample values will be denoted by the respective lower case letters, and their alphabets 
will be denoted by the respective calligraphic letters. A similar convention will apply to random 
vectors and their sample values, which will be denoted with the same symbols in the boldface font. 
Thus, for example, X will denote a random n-vector {Xi, . . . , and x = {xi, Xn) is a specific 
vector value in -Y", the ra-th Cartesian power of X. 

Sources and channels will be denoted generically by the letters P and Q. Specific letter prob- 
abilities corresponding to a source Q will be denoted by the corresponding lower case letters, e.g., 
q{x) is the probability of a letter x e X. A similar convention will be applied to the channel P and 
the corresponding transition probabilities, p{y\x), x e X, y e y. The expectation operator will be 
denoted by £'{•}. 

The cardinality of a finite set A will be denoted by |„4|. For two positive sequences {a„} and 
{bfi}, the notation a„ = 6„ means that a„ and 6„ are asymptotically of the same exponential 
order, that is, lim„_>oo Mn = 0. Similarly, a„ < 6„ means that lim sup„_>j^ ^ In |2s. < 0, etc. 
Information theoretic quantities like entropies and mutual informations will be denoted following 
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the usual conventions of the Information Theory hterature. 



3 Background 

In this section, we provide some basic background in statistical physics, focusing primarily on the 
REM, along with its relevance to ordinary ensembles of source and channel block codes, and then 
we extend the scope to the GREM. 

3.1 General 

Consider a physical system with a large number n of particles, which can be in a variety of 'mi- 
crostates' pertaining to the various combinations of the microscopic physical states (characterized 
by position, momentum, spin, etc.) that these particles may have. For each such microstate of the 
system, which we shall designate by a vector x, there is an associated energy, given by an energy 
function (Hamiltonian) £{x). One of the most fundamental results in statistical physics (based on 
the law of energy conservation and the basic postulate that all microstates of the same energy level 
are equiprobable) is that when the system is in equilibrium, the probability of a microstate x is 
given by the Boltzmann distribution 



where /? is the inverse temperature, that is, /3 = 1/T, T being temperaturejj and Z{P) is the 
normalization constant, called the partition function, which is given by 



depending on whether x is discrete or continuous. The role of the partition function is by far 
deeper than just being a normalization factor, as it is actually the key quantity from which many 
macroscopic physical quantities can be derived, for example, the free energy is -F = — ■^lnZ(/3), 

^ More precisely, /3 — l/{kT), where k is Boltzmann's constant, but following the common abuse of the notation, 
we redefine T <— kT as temperature (in units of energy). 




(1) 



Z(/3)=^e-'3^(^) 



X 



or 
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the average internal energy (i.e., the expectation of S{x) where x drawn is according ([T|)) is given 
by the negative derivative of lnZ(/3), the heat capacity is obtained from the second derivative, etc. 

One of the important examples of such a multi-particle physical system is that of a magnetic 
material, in which each molecule has a magnetic moment, a three-dimensional vector which tends 
to align with the magnetic field felt by that molecule. In addition to the influence of a possible 
external magnetic field, there is also an effect of mutual interactions between the magnetic moments 
of various (neighboring) molecules. Quantum mechanical considerations dictate that the set of 
possible configurations of each magnetic moment (spin) is discrete: in the simplest case, it has 
only two possible values, which we shall designate by +1 (spin up) and —1 (spin down). Thus, 
a spin configuration, i.e., the vector of spins of n molecules, is designated by a binary vector 
X = (xi, . . . , Xn)-, where each component Xi takes values in {—1, +1} according to the spin of the 
z-th molecule, i = 1, 2, . . . , n. When the spins of a certain magnetic material tend to align in the 
same direction, the material is called ferromagnetic, and a customary model of the Hamiltonian, 
the Ising model, is given by 

n 

£{x) = —j'^^XiXj — B'^^Xi (2) 

i,j i=l 

where the in first term, pertaining to the interaction, J > describes the intensity of the interaction 
with the summation being defined over pairs of neighboring spins (depending on the geometry of 
the problem), and the second term is associated with an external magnetic field (proportional to) B. 
When J < 0, the material is antiferromagnetic, namely, neighboring spins 'prefer' to be antiparallel. 
More general models allow interactions not only with immediate neighbors, but also more distant 
ones, and then there are different strengths of interaction, depending on the distance between the 
two spins. In this case, the first term is replaced, by the more general form — j JijXjXj, where 
now the sum can be defined over all possible pairs {(i,^)}!^] Here, in addition to the ferromagnetic 
case, where all Jij > 0, and the antiferromagnetic case, where all Jij < 0, there is also a situation 
where some Jij are positive and others are negative, which is the case if a spin glass. Here, not all 
spin pairs can be in their preferred mutual position (parallel/antiparallel), thus the system may be 
frustrated. 

To model situations of disorder, it is common to model Jij as random variables (RV's) with, 

^Moreover, the interaction term may be generalized to include also summations over triples of spins, quadruples, 
etc., but we will limit the discussion to pairs. 
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say, equal probabilities of being positive or negative. For example, in the Edwards-Anderson (EA) 
model j34J, Jij are taken to be i.i.d. zero-mean Gaussian RV's when i and j are neighbors and zero 
otherwise. In the Sherrington-Kirkpatrick (SK) model [35], all {Jij} are i.i.d. zero-mean Gaussian 
RV's. Thus, the system has two levels of randomness: the randomness of the interaction coefficients 
and the randomness of the spin configuration given the interaction coefficients, according to the 
Boltzmann distribution. However, the two sets of RV's are normally treated differently. The 
random coefficients are considered quenched RV's in the terminology of physicists, namely, they 
are considered fixed in the time scale at which the spin configuration may vary. This is analogous 
to the situation of coded communication in a random coding paradigm: A randomly drawn code 
should normally be thought of as a quenched entity, as opposed to the randomness of the source 
and/or the channel. 

3.2 The REM 

In |21j.[22j.[23j. Derrida took the above described idea of randomizing the (parameters of the) 
Hamiltonian to an extreme, and suggested a model of spin glass with disorder under which the 
energy levels {£{x)} are simply i.i.d. RV's, without any structure in the form of ([2|) or its above- 
described extensions. In particular, in the absence of a magnetic field, the 2" RV's {£{x)} are taken 
to be zero-mean Gaussian RV's, all with variance nJ^/2, where J is a parameter^ The beauty of 
the REM is in that on the one hand, it is very easy to analyze, and on the other hand, it consists 
of sufficient richness to exhibit phase transitions. 

The basic observation about the REM is that for a typical realization of the configurational 
energies the number of configurations with energy about E (i.e., between E and E + dE), 

N{E), is proportional (up to sub-exponential terms in n) to 2" • e--^'/("-^'), as long as |S| < = 
nJ\/ln2, whereas energy levels outside this range are typically not populated by spin configurations 
{N(E) = 0), as the probability of having at least one configuration with such an energy decays 
exponentially with n. Thus, the asymptotic (thermodynamical) entropy per spin, which is defined 

by 



^The variance scales linearly with n to match the behavior of the Hamiltonian ([2} with a limited number of 
interacting neighbors and random interaction parameters, which has a number of independent terms that is linear in 
n. 
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is given by 



S{E) 



ln2 - 


— oo 



' E \ 



\E\ < Eq 

\E\ = Eq 

\E\ > 



The partition function of a typical realization of a REM spin glass is then 

rEo 



'Eo 
Eo 

-Eo 



(IE ■ N{E) ■ e 



-PE 



dE ■ e"^(^) • e-^^ 



(3) 



whose exponential growth rate, 



4 lin, lillM, 

n— >oo n 



behaves according to 



max 

\E\<Eo 



max 

\E\<Eo 



S{E)-P-- 

n 



In 2 



E 
nJ 



E 
nJ 



(4) 



Solving this simple optimization problem, we find that (/>(/?) is given by 



In 2 + 



13 < 5ViK2 

'2. 
J 



(3 > 4Vln2 



/3JVln2 

which means that the asymptotic free energy per spin, a.k.a. the free energy density, which is 
obtained by 

m 

is given by (cf. [SJ Proposition 5.2]): 



m 



In 2 

-jVln2 



2 
J 



Thus, the free energy density is subjected to a phase transition at the inverse temperature /3o = 
j\/ln 2. At high temperatures (/? < Pq), which is referred to as the paramagnetic phase, the partition 
function is dominated by an exponential number of configurations with energy E = -n(3J^/2 and 
the entropy grows linearly with n. When the system is cooled to P = Po and beyond, which is 
the glassy phase, the system freezes but it is still in disorder - the partition function is dominated 
by a subexponential number of configurations of minimum energy E = —Eq. The entropy, in this 
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case, grows sublinearly with n, namely the entropy per spin vanishes, and the free energy density 
no longer depends on /?. Further details about the REM can be found in [5| and the references 
mentioned in the Introduction. 



3.3 The REM and Random Code Ensembles 

As described in [5], there is an interesting analogy between the REM and the partition function 
pertaining to finite temperature decoding [18] of ensembles of channel block codes (see also |26|)- 

In particular, consider a codebook C of Af = e"^ binary codewords of length n, xi,. . . ,xm, 
to be used across a binary symmetric channel (BSC) with crossover probability p. Given a binary 
vector y at the channel output, consider the generalized posterior parametrized by /?: 

Pf{y\x) 



Pp{x\y) 



Ex'ecP^(y\^') 

^~l3BdH{X,y) 
^-l3BdH{X,y) 



my) 



(5) 



where B = \n dH{x, y) is the Hamming distance between x and y, and where the real posterior 
is obtained, of course, for (3 = 1. This is identified as a Boltzmann distribution whose energy 
function (which depends on the given y) is £-{x) = BdH{x,y). As described in [5] and [26], there 
are a few motivations for introducing the temperature parameter P here. First, it allows a degree 
of freedom in case there is some uncertainty regarding the channel noise level (small (3 corresponds 
to high noise level). Second, it is inspired by the ideas behind simulated annealing techniques: 
by sampling from while gradually increasing (3 (cooling the system), the minima of the energy 
function (ground states) can be found. Third, by applying symbolwise MAP decoding, i.e., decoding 
the i-th. symbol of x as arg max^ P/j (x^ = a\y), where 



Ppixi = a\y) = Ppix\y), 



XeC: Xi=a 

we obtain a family of finite-temperature decoders parametrized by /5, where (3 = 1 corresponds 
to minimum symbol error probability (with respect to the true channel) and (3 ^ oo corresponds 
to minimum block error probability. As in [5j, we will distinguish between two contributions of 
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Z{l3\y): One is Zc(/?|y) = e-P^'^»^^°^V\ where £Co is the actual codeword transmitted, and the 
other is Ze{P\y) = Ylx'eC\Xo^~^^'^"^^'''^^ ^ pertaining to ah incorrect codewords. The former is 
typicahy about e~^^'"P since dH{xo,y) concentrates about np. We next focus on the behavior of 

ZeiP\y). 

To this end, consider a random selection of the code C, where every bit of every codeword is 
drawn by an independent fair coin tossing. For a given y, the energy levels {Bdnix, y)} pertaining 
to all incorrect codewords are RV's (exactly like in the REM) because of the random selection of 
these codewords. Now, the total number of correct codewords is about e"^, and the probability 
that a randomly chosen x would fall at distance d = nS from y is exponentially eP'lHS)-^^^] ^ where 

h{S) = -(5 In (5 - (1 - (5) ln(l - S), 

then the typical number of codewords at normalized distance 6 is about 

as long as R + h{S) - ln2 > and N{S) = when R + h{d) - In 2 < 0. Thus, letting S{R) denote 
the small solution to the equation R + h{S) — ln2 = (the Gilbert-Varshamov distance), we find 
that, with a clear analogy to the REM, the corresponding thermodynamical entropy is given by 

' R + h{6) - ln2 S{R) <5 <1- 6{R) 
S{S) = I 5 = S{R) or 6=1- S{R) (6) 

^ -oo 6 < S{R.) or S>1- S{R.) 

Accordingly, the partition function Ze{(3\y) of a typical code is given by 

ZJ/3\y)= Y e"[-^+'^('')-''^2l ■e-^^"'^ = exp{n[i?-ln2+ max {h{S) - PB6)]}, (7) 
S^R) SiR)<S<i-S(R)^ 

and the free energy density pertaining to Z^. behaves according to 

^ ' 1 BS(R) P>l3o ^ ' 



where 



Pf3 



and 



p/3 + (l_p)/3 

ln[(l - S{R))/S{R)] 
B 
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and where, again, the first hne of Ff,{(3) corresponds to the paramagnetic phase with exponentiahy 
many codewords at distance (energy) from and the second hne is the glassy phase with 
sub exponentially many codewords at distance n5{R). In [26j, these free energies are related to 
random coding exponents as mentioned in the Introduction. 

By the same token, in rate-distortion source coding, if one defines the partition function as 



with X being the source vector, {a;} being the reproduction codevectors, and dH{x,y) being the 
Hamming distortion measure, then the same analysis takes place. In the sequel, we will motivate 
this definition of the partition function of rate-distortion coding and use it. 



As we have seen, the REM is an extremely simple model to analyze, but its simplicity is also 
recognized as a drawback from the aspect of faithfully modeling a spin glass. The reason for 
this is the lack of structure which is needed to allow dependencies between energy levels of spin 
configurations that are closely related: For example, if x and x' differ only in a single component, 
it is conceivable that the respective energies would be close, as suggested by ([2]). To this end, as 
described in the Introduction, Derrida proposed a generalized version of the REM - the GREM, 
which introduces dependencies between configurational energies in an hierarchical fashion. We next 
briefly review the GREM. 

A GREM with k levels can best be thought of as a tree with 2" leaves and depth k, where each 
leaf represents one spin configuration. This tree is defined by k positive parameters, ai, . . . ,0^, 
which are all in the interval (1, 2), and whose product, ^^=1 equals 2. The construction of this 
tree is as follows: The root of the tree is connected to a" distinct nodes' which will be referred to 
as first-level nodes. Each first-level node is in turn connected to 02 distinct second-level nodes, 
thus a total of (0102)" second-level nodes. In the case k = 2, these second-level nodes are the 
leaves of the tree and aiQ2 = 2. If A; > 2, the process continues, and each second-level node is 
connected to ag third-level nodes, and so on. At the last step, each one of the YliZi Q^? nodes 
®We are approximating q", Q2 ) • ■ • Qfc by integers. 




3.4 The GREM 
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at level A; — 1 is connected to distinct leaves, thus a total of Y[i=i ~ 2" leaves. The REM 
corresponds to the degenerate special case where k = 1. 

The random selection of energy levels for the GREM is defined by another set of k parameters, 
ai,a2, ■ ■ ■ ,ak, which are all positive reals that sum to unity. The random selection is carried out 
in the following manner: For each one of the ri}=i '^j branches emanating from (i — l)-th level 
nodes and connecting them to i-th. level nodes (i = 1, 2, . . . , /c) in the tree, we randomly choose an 
independent RV, henceforth referred to as a branch energy, which is a zero mean, Gaussian RV with 
variance where J is like in the REM and where {aj}^^^ are as described above. Finally, 

the energy level of a given configuration is given by the sum of branch energies along the path 
from the root to the leaf that represents this configuration. Thus, the total energy, is the sum of k 
independent zero-mean Gaussian RV's with variances and so, it is zero-mean Gaussian 

RV with variance nJ^/2, exactly like in the REM. However, now the energy levels of different 
configurations may be clearly correlated if the paths from the root to their corresponding leaves 
share some common branches before they split. The degree of statistical dependence is according 
to their distance along the tree. For example, if two configurations are first-degree siblings, i.e., 
they share the same parent node at level k — 1, then all their energy components are the same 
except their last branch energies, which are independent. On the other extreme, if their paths are 
completely distinct, then their energies are independent. 

The GREM for A; = 2 is analyzed in [27]. We next present the derivation for this case (with a 
few more details than in [27j). Let ai and 02 be positive numbers whose product equals 2, and let 
ai and 02 be positive numbers whose sum equals 1. Now, every configuration with energy E has 
some first-level branch energy e and second-level branch energy E — e. For a typical realization of 
this GREM, the number of first-level branches with energy about e is exponentially 



Ni{€) = ■ exp { -:j2^| = exp <! n 



no 



1 1 / e \2 

In ai — 

ai \nJ 



provided that the expression in the square brackets is non-negative, i.e., |e| < eo = nJ^^ai Inai, 
and A^i(e) = otherwise. Therefore, the number of configurations with total energy about E is 
exponentially 

N2{E) = r de-Ni{e)- exp In In 02 - — (^^^ 

J~eo 02 \ nJ J 
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whose exponential rate (the entropy per spin) is given by 



n^oo n k|<eo 



1 / e \2 1 /'E-e^^' 

in ai — - + in 02 



ai \nJJ a2 \ nJ 



Note that S{E) is an even function, non-increasing in \E\, and it should be kept in mind that 
beyond the value of \E\ at which S{E) vanishes, denote it by E, we have S{E) = — oo since N2{E) 
is typically zero (as was the case with the REM). We shall get back to this point shortly, but for 
a moment, let us ignore it and solve the maximization problem pertaining to the above expression 
of S{E), as is. Denoting the resulting maximum by S{E) (to distinguish from S{E), where E and 
the jump to oo are taken into account), we get: 

where Ei = nJ ^ (lnai)/ai. Taking now into account the above mentioned observation concerning 
the criticality of the point \E\ = E^ we have to distinguish between two cases. The first is the 
case where E < Ei, namely, the first line of the above expression of S{E) vanishes for \E\ smaller 
than El. The first line vanishes for \E\ = Eq = nJVln 2, so the condition for this case to hold is 
Eq < El, or equivalently, (lnai)/ai > In 2. In this case, we then have: 

f -(^)^ \E\<Eo 
S{E) = 1 \E\=Eo 
( -oo 1^1 > Eo 

which is exactly the same behavior as in the ordinary REM (fc = 1). Consequently, the exponential 
rate of the partition function, which is given by 

.E] 



lnZ(/3) 

(p{p) = nm = max 

n^oo n E 



S{E)-P- 
n 



is also the same as in the REM, namely, 

m = 



ln2 + ^ /3</3o 
/3J\/hr2 /? > /?o 



where (3q is the above defined critical inverse temperature of the REM (see Subsection [37 

We next consider the complementary case where (lnai)/ai < In 2. In this case, the expression 
of S{E) should take into account the fact that it vanishes (and then becomes — oo) according to 
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the second line of (El). This amounts to: 



SiE) 



ln2-{ijy ^ \E\<E, 

^^»2-^{ij-Vai\naif E^ < \E\ < E2 ^^q. 

1^1 = E2 

-00 \E\ > E2 



where E2 = n J(v^a7hiai + \/a2 In 02)- Before we compute the corresponding partition function, 



we make the following observation: 



, ^ In ai + In 02 ^ In Oi 
in 2 = < max ■ 



ai + a2 i=i,2 tti 

where the inequality follows from the well-known inequality [X^I^i ^i] < maxi<j<m ai/bi 

for positive {oj} and [361 Lemma 1]. In the same manner, using the similar inequality 

EI^i bi] > mini<i<m ai/bi, we get 

in 2 > mm . 

i=l,2 Qi 

It follows then that the condition (In ai)/ai < In 2 is equivalent to the condition (In ai)/ai < In 2 < 
(Ina2)/a2- Defining 

2 I In ai 

J V CLi 

we then have Pi < f3o < (32- Let us examine how (f){f3) behaves as f3 grows from zero to infinity. For 
small enough f3, the achiever of </)(/3), call it E*, is still smaller in absolute value than Eq, and then 
it is obtained from equating to zero the derivative of [S{E) — (3E/n], with S{E) being according to 
first line of (jlOp . thus E* = —^pj'^. This remains true as long as ^pj"^ < Ei, which means P < Pi- 
In this case, the partition function is dominated by exp{n[lnai — ai/9^J^/4]} first-level branches 
with energy e* = — ^n/3J^, each followed by exp{n[lnQi — ai/3^J^/4]} second-level branches with 
energy E* — e* = — ^npj"^, and this is a pure paramagnetic phase. As P continues to grow beyond 
Pi, but is still below P2, the partition function is dominated by a subexponential number of first- 
level branches of energy —nJ\/ai Inai followed by exp{n[lnQ!i — J^/4]} second-level branches 



with energy E* + nJ\/ai Inai. This is a "semi-glassy" phase, where the first-level branches are 
already glassy but the second-level ones are still paramagnetic. As P exceeds P2, this becomes a 
pure glassy phase where the partition function is dominated by a subexponential number of first- 



level branches with energy —nJ^Jai Inoi and a subexponential number of second-level branches 
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with energy —nJ\/a2 In a2- Accordingly, the function (/>(/?) exhibits two phase transitions at inverse 
temperatures /3i and jii'- 



m 



/gJV ailnai + In 02^^^^ /3i < /? < /?2 
l5J[\Ja\ Inai + ^/a2 In 02) P > P2 



Again, the free energy density is obtained by F{(3) = —4>{f3)/f3. 

This different behavior of the GREM for the two different cases will be pivotal to our later 
discussion on the parallel behavior of ensembles of codes. When there is a general number k of 
levels, the above analysis of the GREM becomes, of course, more complicated and there are more 
cases to consider, but the concepts remain the same. There can be up to k phase transitions, but 
there can be less, depending on the parameters of the model {aj,ai}^^^. For details, the reader is 
referred to [28].[29]. 

4 Relations Between GREM and Hierarchical Code Ensembles 

In analogy to the relationship between the REM and ordinary ensembles of block codes, as was 
described in Subsection 13.3^ it is natural to wonder about the possibility of similar relationships 
between the GREM and more general ensembles of block codes, and to ask whether the fact that 
the GREM exhibits different types of behavior (as we have seen in Subsection [331), has implications 
on the behavior of these ensembles of codes. Since the GREM is defined by an hierarchical (tree) 
structure, it is plausible to expect that if a relationship to coding exists, it will be in the context of 
ensembles of codes which have hierarchical structures as well. Hierarchically structured ensembles 
of codes are encountered in numerous applications in Information Theory, including block-causal 
tree-structured source codes and channel codes of the type described informally in the Introduction, 
successive refinement source codes [37] , [38] , [39] , codes for the broadcast channel (iQl Chap. 15.6] 
and codes based on binning techniques (see, e.g., |1I] , [12] , [IS] ) , just to name a few. In this paper, 
we confine our attention to the first above-mentioned class of codes. 

The fact that the GREM behaves, in some situations, like the REM, and the REM is analogous 
to an ordinary block code without any hierarchical structure (cf. 13. 3p . may hint that in the parallel 
situations in the realm of our coding problem, a typical code from the hierarchical ensemble will 
perform essentially as well as a typical (good) code without the hierarchical structure. In these 
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situations then (which can be imposed by a clever choice of certain design parameters), it would be 
interesting to explore the question whether we may enjoy the benefit that the hierarchical structure 
buys us (in our case, reduced delay) without essentially paying in terms of performance. As we 
show in this section, the answer to this question turns out to be affirmative to a large extent, both 
in the source coding setting and in the channel coding setting. 

Finally, in closing this introductory part of Section HI a more technical comment is in order: As 
in Subsection 13.31 throughout the sequel, we confine ourselves to the memoryless binary symmetric 
source (BSS) with the Hamming distortion measure, in the context of source coding, and to the 
binary symmetric channel (BSC) in the context of channel coding. The random coding distribution 
in both problems will be i.i.d. and uniform, i.e., each bit of each codeword will be drawn by 
independent fair coin tossing. Also, we will focus mostly on the case k = 2. The reason for this is 
that our purpose is this paper is more to demonstrate certain concepts, and so, we prefer to slightly 
sacrifice generality at the benefit of simplicity, and so, better readability, and a smaller amount of 
space. Having said that, all the derivations can be extended to apply to more general memoryless 
sources, channels, and random coding distributions (as was done in |26j). as well as to a general 
number k of stages. 

4.1 Lossy Source Coding 

Consider the BSS Xi, X2, ■ ■ ■, Xi G {0, 1} {i- positive integer) and the Hamming distortion measure 
between two binary n-vectors x and x: 



where dH{a,b) = 1 if a ^ b and dH{a,b) = if a = b, a,b £ {0, 1}. Before discussing ensembles 
of codes with hierarchical structures, let us first confine attention to an ordinary ensemble with no 
structure. 

Consider a random selection of a codebook of size M = e"'^ {R being the coding rate in nats 
per source bit), C = {xi, . . . , xm}, Xi £ {0, l}*^, i = 1,2, ... , M, where each component of each 
codeword is drawn randomly by an independent fair coin tossing. For a given source vector x and 
for a given such randomly drawn codebook C, let A(a;) = min^^g^ d//(a;, a;) denote the distortion 
associated with encoding x. 



n 




i=l 
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Instead of examining the expected distortion, E{A{X)}, w.r.t. both the source and the random 
codebook selection, as is traditionahy done, we wih concern ourselves with a more refined and more 
informative objective function, which is the characteristic function of A(X), namely, 

^n{s,R) = E{exp[-sA{X)]}, 

or in particular, its exponential rate 

^(.,i^) = -lim^^^-(^'^) 

n— >oo n 

focusing on the range s > 0. As is well known, the characteristic function provides information 
not only on the expected distortion, E{A{X)}, but also on every moment of A{X) (by taking 
derivatives of at s = 0). It is also intimately related to the tail behavior (i.e., large 

deviations probabilities) of the distribution of A(X) via Chernoff bounds. 

In order to analyze ^n{s,R) and then iJ;{s,R), first, for an ordinary ensemble, and later for 
an hierarchical structured ensemble, it is convenient to define, for given x and C, the partition 
functiorif 

Z(/3|a;) = ^e-^'^^^^'*). (11) 

The function \E'„(s,ii) is obtained from the partition function by 

^„(s, R) = E{ lim Z^/^(s • 9\X)} = lim E{Z^/^{s ■ e\X)}. 

In the definition of the ensemble behavior of ^{s, R), there are now two options. The first is to think 

of the above defined expectation of Z^^^ (s9\X) as being taken w.r.t. both the source X and the 

code ensemble {C}, and then to define ip{s,R) as above. The second option is to define the above 

expectation of Z^^^{s9\X) w.r.t. the source only, while keeping C fixed, and then to define ip{s, R) 

as —liinn^ooE{ln^n{s,R)}/n, where the latter expectation is across the ensemble of codebooks 

{C}. The difference between meanings of the two approaches is in the point of view: In the former 

approach the randomness of both X and C are treated on equal grounds, and this makes sense 

if X and C vary on the same time scale (e.g., when the codebook varies frequently according to 

some secret key). In the parallel discussion on spin glasses (cf. Section 3.1), this is analogous to 

^For a given x, the partition function Z{P\x) induced by a typical codebook is exactly the same as in ([7]l, with 
the minor modification that here /3 is not scaled by -B as in ([7|l. 
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the double randomness of both the spin configuration and the interaction parameters, and in the 
language of statistical physicists, this is called annealed averaging. The second approach, which 
physicists refer to as quenched averaging, fits better the paradigm where the code C is held fixed 
over many realizations of the source X . In the Information Theory literature, it is more customary 
to adopt an approach analogous to annealed averaging^ and so, we shall do the same here. 



4.1.1 The Ordinary Ensemble 



Let us begin the with the calculation of the annealed version of ■iIj{s,R), first, for a an ordinary 
non-hierarchical code: 

1/ 

exp^—sOdniX ,x)) 

xec 

i/e- 



E{Z^/\s9\X)} = E< 



E 



-sed 



.d=0 

n 



E{yN^I'{d)-e 



—sd 



.d=Q 



Y^E{N^I\d)}-e 



[12) 



d=0 



where N{6) is the number of codewords whose normalized Hamming distance from X is exactly S, 
and where the third (exponential) equality holds, even before taking the expectation, because the 
summation over d consists of a subexponential number of terms, and so, both ['^^N{d)e^^^'^]^^^ 
and N^/^{d)e^^^ are of the same exponential order as max^^ N^/^{d)e'~^^ = [max^^ N{d)e^^^^]^^^ . 
This is different from the original summation over C which contains an exponential number of terms. 
Now, as is shown in Subsection A.l of the Appendix (see also 



E{N^/^{n5)} 



^n[R+h{5)-ln2] S < 6{R) OT 6 > I - 6{R) 
^n[R+h{5)-ln2]/e s{R) < 6 < I - 6{R) 



(13) 



^"in particular, source and channel random coding exponents are normally defined as exponential rates of ensemble- 
average error probabilities, and not as ensemble-average exponents of error probabilities. 
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where 5{R) is defined (cf. Subsection I3.3P as the smah solution to the equation R + h{6) — ln2 = 0, 
which is also the distortion-rate function of the BSS. This gives 

E{Z^/'^{s9\X)} = gn[i?+h(5)-ln2] . g-s5n _^ ^ ^n[R+h{S)-ln2]/e _ ^-sSn 

S<S{R) S>S{R) 

= A + B (14) 

Now, as 9 ^ oo, the term B tends to X](5>(5(fl) e"*"^", which is of the exponential order of e~"'^^(^). 
The term A, which is independent of 9, is of the exponential order of e""'"^*'^), where 

u{s, R) = \n2-R - max [h{6) - s5] = { ""f " 

s<S(Ry ^ ^ ' \ v{s,R) s> SR 



where 



and 



SR = In 



1 - 6{R) 



6{R) 



v{s,R) = ln2-i? + s-ln(l + e'). 

Since v{s,R) never exceeds s5{R) for s > sr, the dominant term is A, and therefore, for the 
ordinary block code ensemble, we have: 

V'(s,i?) = u{s,R). 

It is not difficult to show also, using sphere covering considerations, that u{s, R) is the best achiev- 
able performance in terms of the exponential rate of the characteristic function of the distortion. 
The function u{s, R) is depicted qualitatively in Fig. [TJ 

4.1.2 The Hierarchical Ensemble 

We proceed to define the ensemble of hierarchical codes and to analyze its performance with relation 
to the GREM. Let n = ni + n2, where n, rii and n2 are positive integers. For a given Ri, 
consider a random selection of a codebook of size Mi = e"'^^''-, Ci = {xi, . . . ,xmi}, xi e {0, l}'^^ 
i = 1, 2, ... , Ml, where each component of each codeword is drawn randomly by an independent 
fair coin tossing. Next, given i?2j for each i = 1, 2, ... , Mi, consider a similar random selection of 
a codebook of size M2 = e^^-^a^ C2(i) = . ..Xi^Mi}, ^ {0, l}"^ j = 1,2, . . . ,M2. 
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u{s, R) 

ln2- R 




Figure 1: u{s,R) as a function of s for fixed R. 

The encoder works as follows: Given a source vector x G {0, 1}", it finds a pair of indices 
i = 1,2, . . . , Ml, j = 1,2, . . . , M2, such that the distortion between x and the concatenation of the 
codewords {xi,Xij) is minimum. The index i is encoded by niRi nats and the index j (given i) is 
encoded by n2-R2 nats, thus a total of nR = niRi + n2-R2 nats, where R is the overall rate, given 

by 

R = XRi + (l- \)R2, A = — . 

n 

The decoder can, of course, generate the first-stage reproduction Xi based on the first niRi nats 
received, without having to wait for the n2i?2 following ones. The extension of this hierarchical 
structure to a larger number of stages k should be obvious. In particular, as mentioned in the 

Introduction, if k divides n and the n-block is divided to k sub-blocks of length n/k each, then 
the decoder can generate chunks of the reproduction at a reduced delay of n/k instead of n. 

The analogy of this structure with the GREM should also be obvious. The code has a tree 
structure and the configurational energies of the GREM play the same role as the distortion here, 
as the overall distortion is the cumulative sum of the per-stage distortions. Also, the coding rate 
Ri here plays the same role as Ina^ of the GREM (i = 1,2). Thus, it is natural to expect that the 
partition function Z{f3\x) of this code ensemble would behave analogously to that of the GREM, 
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as we shall see next. 



For the sake of simplicity, we return to the case k = 2, with the understanding that our 
derivations can be extended without any essential difficulties to a general k. Before analyzing the 
characteristic function of the distortion along with its exponential rate, it is instructive to examine 
the partition function Z{P\x) for a given x and address the analogy with that of the GREM. 

For a given x and a typical code in the ensemble, there are Ni{Si) = e"i[-'^i+'*('^i)~''^2l first-stage 
codewords {x} at distance nidi from the vector formed by the first ni components of x, provided 
that 6i > d{Ri) and Ni{5i) = otherwise. For each one of these first-stage codewords, there are 
gn2[-R2+fe(52)-ln2] ggcond-stage codewords {x} at distance 722(^2 from the vector formed by the last 
n2 components of x, provided that 62 > ^(^2)- Thus, the total number of concatenated codewords 
{{x, x)} at distance n6 = nidi + 712^2 (that is, 6 = XSi + (1 — X)52) from x is given by 
i-S(Ri) 

m) = E 



,ni[Ri+/i(5i)-ln2] . n2[R2+/i((5-A5i)/(l-A))-ln2] 



Si=S{Ri) 

exp < n 



max 

<5(Ri)<<5i<l-<5(iii) 



R + \h{Si) + {l-X)h 
Consequently, the exponential growth rate of N2{6) is given by 



d-Xdi 
1- A 



ln2 



(15) 



3(6) 



max 
5{Ri)<Si<l-S{Ri) 



R + Xh{5i) + {1-X)h 



5-XSi 
1-A 



In 2 



For large S, the constraint d{Ri) < Si < 1 — S{Ri) is inactive and the achiever of S{d) is Si = S, 
and then 

S{S) =R + Xh{6) + (1 - X)h{6) -\n2 = R + h{S) - In 2. 

If we now gradually reduce S, the behavior depends on whether we first encounter the value S = 
d{Ri), below which Si = 6 no longer satisfies the constraint, or the the value S = S{R), below which 
S{d) = R + h{d) — In 2 vanishes. This in turn depends on whether d{Ri) is larger or smaller than 
S{R), or equivalently, if Ri < R < R2 or Ri > R > R2. 

Consider the case Ri > R> R2 first. In this case, d{Ri) < 5{R) < S{R2), and we have: 



S{5) 



' R + h{d)-ln2 5{R) < 5 < I - 6{R) 
6 = 6{R) or 6 = 1- S{R) 

-00 6 < sIr:) or S>1- 5(ii) 



(16) 
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exactly like in the ordinary, non-hierarchical ensemble (cf. eq. ([6])), and then the corresponding 
exponential rate of the partition function is as in Subsection 13.31 except that here fi is not scaled 
hyB, i.e., ct){f3) = -u{P,R). 

The other case is i?i < i? < i?2i which is equivalent to 5[Ri) > 6{R) > S{R2). Here, in analogy 
to the GREM with two phase transitions, we have: 



m 



{ -v{(3,R) p<mi) 

-X/35{Ri) - (1 - X)v{(3, R2) mi) <I3< P{R2) 
-(3[X6{Ri) + (1 - X)6iR2)] P > P{R2) 



We now identify the first line as the purely paramagnetic phase, the second line - as the "semi- 
glassy" phase (where {x} are glassy but {x} are paramagnetic), and the third line - as the purely 
glassy phase. Note that the glassy phase here behaves as if the two parts of the code, at rates Ri 
and i?2, were operating independently, namely, as if {C2(i)}*^\ were all identical, in which case, the 
distortion would have been minimized separately over the two segments. We will get back to this 
point in the sequel. 

We have seen then that the ensemble behaves substantially differently depending on whether 
Ri ^ R2 or Ri < i?2- In the former case, the above calculation may indicate that the ensemble 
performance is similar to that of an ordinary block code of length n without any structure. We 
next carry out a detailed analysis of the characteristic function and its exponential rate, which we 
shall denote by 'ip{s, i?2)- 

Similarly as before, we first compute E{Z^^^ {s6\X)}: 



EiZ^/'^ {se\X)} = E{ 



ni n2 



Y^N{di,d2)-e- 

di=0 d2=0 



ii+d2) 



di=0(i2=0 



(17) 



where N{di, ^2) is the number concatenated codewords {(a;, i)} for which the first stage contributes 
distance di and the second stage contributes distance d2- For the moments E{N^/^ {di,d2)^, or 
equivalently, -E{A^^/^(ni5i, 712(^2)}) the following is proven in Section A. 2 of the Appendix: 



E{N^I\ni5i,n252)] = { 



{ exp{n[ATyi + (1 - X)W2]} 5i e l%Ri), 62 G J^(i?2) 

exp{n[At^i + (1 - A) 1^2/^]} h £ I%Ri), 62 G T(i?2) 

exp{n[At^i + (1 - X)W2]/e} 5i G I{Ri), 62 G I{R2) 

[ exp{nr][XWi + (1 - X)W2]} 5i G I{Ri), 62 G I%R2) 



(18) 
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where I{R) = {5{R), 1 - S{R)), T^R) = [0, 1] \T{R), Wi = W{5i, Ri), i = l,2, with W{S, R) being 
defined as 

W{S,R) = R + h{5) -ln2 



and 



X ;i \ m , 1 At^i + (1 - X)W2 < 



Therefore, 



E{Z^/\s9\X)} 



<5iex=(iii)<52eX'=(ii2) 

g-sn[A(5i+(l-A)52] _^ 



,n[R+A/i(<5i)+(l-A)/i(52)-ln2] 



E E 



„n[A{iii+?i(<5i)-ln2)+(l-A)(fl2+/i(52)-ln2)/6l] 



<5i6X=(ili)<52 6X(il2) 
g-sn[A(5i+(l-A)52] _^ 

gn[A(ili+h(<5i)-ln2)+(l-A)(fl2+/i(52)-ln2)]/e 

<5ieX(fii)(52 6X(R2) 
g-sn[A5i+(l-A)52] _^ 



E E 

<5ieX(ili)<52 6Z'=(ii2) 
_-sn[A5i+(l-A)52] 



„n»7[A(iJi+/i(5i)-ln2)+(l-A)(i?2+/i(<52)-ln2)] 



= A+B+C+D 



(19) 



Let us now handle each one of these four terms and take the limit 6 ^ oo. This results in: 



A 



gni[Jii+/i(5i)-ln2-s5i] 

_5ieXc(Ri) 
g-niu(s,ili) . g-n2u(s,il2) 

„-n[Au(s,ili)+(l-A)u(s,il2)] 



gn2[ii2+ft(52)-ln2-s52] 

52eX=(il2) 



(20) 



= e 



^ni[Ri+hiSi)-ln2-sSi] 

5iGX<=(Ri) 
g-niu(s,i?i) _ g-n25(R2) 

n[Au(s,ili)+(l-A)5(ii2)] 



E 

<52e2:(K2) 



n2s52 



(21) 



25 



(J ^ g-n[A<5(i?i)+{l-A)5(R2)] ^ (^22) 

and 

D = e-'^/(*'«i'-f?2) (23) 

where 

f{s, Ri,R2) = min {s[\Si + (1 - A)<52] - ;u(5i, 52)[i? + \h{Si) + (1 - A)/i(<52) - In 2]} 

<5iGX(_Ri),(52eX<=(-R2) 

and where 

1 R + Xh{6i) + {I - X)h{62) <ln2 



^^1,^2) |Q R + Xh{6i) + {l-X)h{62)>ln2 
Among the terms A, B, and C, the term A is exponentiahy the dominant one. To check whether 
or not A dominates also D, we will have to investigate the function /(s, i?2)- This is done in 
Subsection A. 3 of the Appendix, where it is shown that this function is as follows: For Ri > R2: 

fis,RuR2) = { ^^'^'(f^) ^ _ x)v(s,R2) ° > In, '''' 

and for Ri < R2- 

■'^ ' \ Xs6{Ri) + {l- X)v{s,R2) s>SR2 ^ ^ 

Finally, the overall exponential rate of the characteristic function, ^|J{s, Ri, R2)), we have to take 
into account the contribution of A, as mentioned above. This gives: 

ip{s,Ri,R2)) = mm{f{s,Ri,R2),a{s,Ri,R2)} 

where a{s, -R2) = Xu{s, Ri) + {1 — X)u{s , R2). Now, in the case Ri > R2, for small s, the function 
/ is linear with slope S{R), whereas the function a is linear with a slope of X5{Ri) + (1 — A)(5(i?2) 
which is larger. Thus, / is smaller in some interval of small s. However, for larger s, f continues 
to have a linear term with slope X6{Ri) whereas a never exceeds the level of In 2 — R. Thus, there 
must be a (unique) point of intersection s*. Consequently, for Ri > R2, we have 



^Pis,Ri,R2) 



/(S,i?l,i?2) S<S* 

a{s, i?2) s > s* 



where /(s,i?i,i?2) is as in ([Ml)- Concerning the case Ri < R2, both / (of eq. ([2^ ) and a start as 
linear functions of the same slope of X6{Ri) + (1 — A)(5(i?2)- However, while the latter begins its 
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curvy part at s = sr^ , the former continues to be linear until the point s = sr^ > sr^ . In this case, 
then it is easy to see that iIj{s, i?2) is dominated by a across the entire range s > 0, i.e., 

i/jis, Ri,R2) = Xu{s, Ri) + (1 - X)u{s, R2). 

We see then that the ensemble performance is substantially different in the two cases: For 
Ri < R2, '4>{s,Ri,R2) is exactly the same as if we used two independent block codes of lengths 
ni and 712 at rates Ri and R2, respectively. In particular, the corresponding average distortion is 
XS{Ri) + (1 — A)(5(i?2) which is, of course, larger than S{R). In other words, we are gaining nothing 
from the tree structure and the dependence between the two parts of the code. For Ri > R2, on 
the other hand, there is at least a considerable range of small s for which ■ijj{,s, Ri, R2) = u{s,R), 
namely, the ensemble performance is exactly like that of the ordinary ensemble of full block code 
of length n and rate R, without any structure (which is also the best achievable exponential rate). 
However, beyond a certain value of s, there is some loss in comparison to the ordinary ensemble. 
The case Ri — R2 — R can be obtained as the limiting behavior of both Ri < R2 and Ri > R2, by 
taking both rates to be arbitrarily close to each other. In this case, we obtain ^(s, i?2) = u{s, R) 
throughout the entire range s > (cf. the discussion on this in the Introduction). The conclusion 
then is that if we use an hierarchical structure of the kind we consider in this paper, it is best to 
assign equal rates at the two stages, but then we might as well abandon the tree structure of the code 
altogether, and just encode the two parts independently, both at rate R (this will moreover save 
complexity at the encoder). If, however, certain considerations dictate different rates at different 
segments, then it is better to encode at a larger rate in the first segment and at a smaller rate in 
the second. 

This derivation can be extended, in principle, to any finite number k of stages. The analysis is, 
of course, more complicated but conceptually, the ideas are the same. We will not carry out this 
extension in this paper. 

4.2 Channel Coding 

In complete duality to the source coding problem, one may consider a channel code (for the BSC) 
with a similar hierarchical structure: Given a binary information vector of length nR = nii?i+n2-R2 
nats, we encode it in two parts: The first segment, of length niRi nats, is encoded to a binary 
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channel input vector of length ni, independently of the forthcoming 71.2^22 nats (thus, the channel 
encoder is of reduced delay). Then, the remaining n2-R2 nats are mapped to another binary channel 
input vector of length n2 and it depends on the entire information vector of length nR. 

The ensemble of codebooks is drawn similarly as before: first, a randomly drawn first-stage 
codebook of size e^^^^, and then, for each one of its codewords, another codebook of size e"^'^^ 
is drawn independently. Once again, each bit of each codeword is drawn by independent fair coin 
tossing. 

The decoder applies maximum likelihood (ML) decoding based on the entire channel output 
vector y of length n = ni +n2, pertaining to the input x of length n. The analogy with the GREM 
is that here, the energy function is the log-likelihood, which is additive over the two stages by the 
memorylessness of the channel. 

In full analogy to the GREM and the source coding problem of Subsection 14.11 and as an 
extension to the derivation in Subsection 13.31 here too, the partition function Ze{(3\y) has exactly 
the same two different types of behavior, depending on whether Ri > R2 or Ri < i?2- Therefore, 
we will not repeat this here. 

Concerning the aspect of performance evaluation of this ensemble of codes, and a comparison to 
the ordinary ensemble, here the natural figure of merit is Gallager's random coding error exponent, 
which can be analyzed using methods similar to those that we used in Subsection 14.11 We will 
not carry out a very refined analysis as we did before, but we will make a few observations in this 
context, although not quite directly related to the GREM. 

Referring to the notation of Subsection 13.31 let C = {xi, . . . ,a;j\/} be a given channel code of 
size M = e""^ and block length n, and let y designate the output vector of the BSC, of length n. 
Gallager's classical upper bound [lU p. 65, eq. (2.4.8)] on the probability of error is well known to 
be given by 

p 



< p < 1. 



m=l y 

Consider first the ordinary ensemble, where all M codewords are chosen independently at random. 
In this case, taking the expectation of both sides, the average error probability is upper bounded 
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by 







As is shown in [26], the second factor of the summand is actually the expectation of the p-th 

moment of the partition function Ze{P\y) computed at the inverse temperature /? = 1/(1 + p). 
Now, at least for the ordinary ensemble, the traditional derivation, which is based on applying 
Jensen's inequality, is good enough to yield an exponentially tight bound [36] on the ensemble 
performance. This amounts to inserting the expectation into the square brackets, i.e.. 



We shall not continue any further with the analysis of this expression. Instead, we shall compare 
it as is, with a corresponding upper bound for the hierarchical ensemble defined above. 

In the hierarchical case with k = 2 stages, the probability of error consists of two contributions. 
The first pertains to all incorrect codewords x = (x', x") whose first segment x' agrees with that 
of the correct codeword, and the second one is associated with all other incorrect codewords. As 
for the former type of codewords, the ML decoder actually compares the likelihood scores of the 
second segment only (as those of the first segment are the same and hence cancel out), and so, these 
incorrect codewords contribute a term of the order of e~"2£'r(R2) ^j^g average error probability, 
where Er{R) is the Gallager's random coding error exponent function [471 P- 139, eq. (5.6.16)]. 
Concerning the second set of incorrect codewords, we can apply an upper bound as above, except 
that the expectations have to be taken w.r.t. the hierarchical ensemble. However, it is easy to see 
that the expectation of E{P{y\X)'^/'^'^+P^ is exactly the same as in the ordinary ensemble, and 
thus, so is the upper bound for this set of codewords, which is then Q-'^^r(R) _ The total average 
error probability is then upper bounded by 

This gives further motivation why i?2 should be chosen smaller than Ri: If R2 > Ri, the second 
term definitely dominates the exponent, because both 722 < n and R2 > R and so £"^(-^2) < Er{R). 
For a given R and A, can we, and if so how, assign the segmental rates -Ri and R2 such that the 
second term would not be dominant, i.e., (1 — X)Er{R2) > Er{R)? If R is large enough this is 



p 



P^^JjEE E{P{y\X^y/^^+^^ ■ Yl E{P{y\X^,)y(^+P^} 
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possible. For example, one way to do this is to select Ri = C, where C is the channel capacity. In 
this case, we have, by the convexity of Er{-): 

Er{R) = Er{XC + (1 - A)ii:2) < XEr{C) + (1 - X)Er{R2) = (1 - X)Er{R2)- 

For this strategy to be applicable, R must be at least as large as AC 

How does this discussion extend to a general number of stages k and is there a more systematic 
approach to allocate the segmental rates Ri,. . . ,Rk for a given overall rate R? For simplicity, 
let us suppose that the segment lengths are all the same, i.e., ni = n2 = . . . = = n/k. The 
extension turns out to be quite straightforward: In the case of k stages there are k types of incorrect 
codewords: Those that agree with the correct codeword in all stages except the last stage, those 
that agree in all stages except the last two stages, etc. Accordingly, using the same considerations 
as above, it is easy to see then that the upper bound on the average error probability consists of k 
contributions whose exponents are 

^Er I ^ >: Rj\, i = 0,l,...,fe-l. 

For convenience, let us denote 



1 

j=i+l 

Under what conditions and how can we assign the segmental rates such that 

k — i 



-Er{Rk) > Er{R) 



k 

for alH = 1, 2, . . . , A; — 1? First, we must select sufficiently small such that E^^Ri) > -j^Er{R). 
As R is given, this will dictate the choice of Ri according to the identity 

R = Rq = ^R^ + ^^Ri. 

Next, we choose R2 small enough such that 

Er{R2) > j^Er{R). 

As ^1 has already been chosen, this will dictate the choice of R2 according to the identity 

^1 = + '^R2, 
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and so on. This procedure continues until in the last step we choose Rk = Rk-i such that Er{Rk) > 
kEr{R), which dictates the choice of Rk-i via Rk-2 = {Rk + ^fc-i)/2, where Rk-2 was selected in 
preceeding step. An obvious condition for this procedure to be applicable is that R would be large 
enough such that Er{R) < Er{0)/k. Note that if some of the segmental rates exceed capacity (or 
even the log alphabet size), this is not a problem, as long as the averages Ri are all small enough. 

Appendix 

A.l Proof of Eq. (fT3|) 

We begin with a simple large deviations bound regarding the distance enumerator, which appears 
also in [44J, but we present here too for the sake of completeness. For a,b £ [0,1], consider the 
binary divergence 

D{a\\b) = aln^ + (l-a)ln.^~" 



b ^ ' 1-6 



a In ^ + (1 — a) In 





b — a 

1 + 



1-6 

To derive a lower bound to Z?(a||6), let us use the inequality 



(A.l) 



and then 



ln(l+x) = -In^^ = -Inf 1 > (A.2) 

^ ^ 1 + x V 1 + x/ - 1 + x' ^ ' 



D{a\\b) > aln? + (l-a) " "^^^ " ^) 



6 ^ ' l + (6-a)/(l-6) 

1 C' 7 

am — + b — a 
6 

a 



> a[ln--lj . (A.3) 

For every given y, N{d) is the sum of the e"^— 1 independent binary random variables, {l{d{Xfn',y) 
d}}m'^m: where the probability that d{Xm/,y) = n6 is exponentially 6 = e-"[in2-/i((5)] _ rpj^g event 
N{n6) > e"^, for A e [0,R), means that the relative frequency of the event l{d(Xm/,y) = n5} is 
at least a = e~"'(^~^). Thus, by the Chernoff bound: 

Pi{N{nd) > e"^} < exp{-(e"^ - l)D(e-"(^-^) He-^ti'^^-hW])! 

< exp {-e"-f^ • e-"(^-^)(n[(ln2 - R - h{6) + A] - 1)} 

< exp{-e"^(n[ln2-i?-/i((5) +^] - 1)} . (A.4) 
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Denoting by T{R) the interval {5{R), \ — 5{R)) and by 2^{R), the complementary range [0, 1] \2{R), 
we have, for 6 G I'^{R): 

E{N'{n5)} < e"" -Prjl < iV(n5) < e"^} + e"-f^' •Pr{iV(n5) > e"^} 

< e'^" • Fr{N{n6) > 1} + e"^" • Pr{N{n6) > e"^} 

< e"" • E{N{n6)} + e"^" • g-^"^"!)*^" 

< e"" • e"[-^+'^('^)~''^2] + e'^'^'' • e~'''^^~^'^^"\ (A. 5) 

One can let e vanish with n sufficiently slowly that the second term is still super exponentially 
smah, e.g., e = Thus, for 6 G I^iR), EiN'ind)} is exponentially bounded by e"[^+^W-i'^2] 

independently of s. For 5 G 1{R)-, we have: 

gnRs . pr{Ar(„5) > gn[/?+h(<5)-ln2+e]| 
< gns[iJ+h(5)-ln2+£] _^ gniJs . g-(ne-l)e"'= ^^^ g^ 

where again, the second term is exponentially negligible. 

To see that both bounds are exponentially tight, consider the following lower bounds. For 
5el%R), 

E{N'{n5)} > V ■ Pr{iV(nJ) = 1} 

= e"^ • Pr{d^(X, y) = n5} ■ [1 - Pr{dH(X, y) = nS}^-' 
1 - e 



-n[ln2-h{5)] 



Using again the inequality in ()A.2p . the second factor is lower bounded by 

exp{-e"^e-"[l'^2-^('^)] /(I - g-^H^^-M^)])} = ^^^{_^-n[ln2-R-hi5)] /(I _ ^-„[1„2-M5)])} 

which clearly tends to unity as \n2 — R — h{6) > for 5 G I^{R). Thus, E{N'^{n5)} is exponentially 
lower bounded by q'^[R+K^)-^''^'A _ Yoi 5 G I{R), and an arbitrarily small e > 0, we have: 

E{N'{n6)} > enslR+hi5)-ln2-e] . pr{Ar(n5) > gn[R+/x(5)-ln2-.]| 
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where Pv{N{n6) < e"[«+'^('5)-i'i2-€]} 

is again upper bounded, for an internal point in I{R), by a 
double exponentially small quantity as above. For 6 near the boundary of I{R), namely, when 
R + h{5) — ln2 ~ 0, we can lower bound E{N'^ {n6)} by slightly reducing R to R' = R — e (where 
e > is very small). This will make 6 an internal point of Z^{R') for which the previous bound 
applies, and this bound is of the exponential order of e"''^ +h(S)-\n2] ^ Since R' + h{5) — In 2 is still 
very close to zero, then e"'^ +h{S)-\n2] -g q£ ^j^^ same exponential order as g"'^[^+'»{'5)-in2] gjj^gg both 
are about e^'"^ . 

It should be noted that a similar double-exponential bound can be obtained for the probability 
of the event {N{n5) < e"'^}, where A < R + h{5) - In 2 and R + h{6) - In 2 > 0. Here we can 
proceed as above except that the in the lower bound on divergence 15(0116) we should take the 
second line of (A. 3) (rather than the third), which is of the exponential order of 6 = ^-"■[^"^'^-K^)] 
(observe that here b is exponentially larger than a, as opposed to the earlier case). Thus, we obtain 
R + h{6) — ln2 > at the second level exponent, and so the decay is double exponential as before. 

A. 2. Proof of Eq. ( flSl ) 

First, let us write N{ni5i,n262) as follows: 

Ml M2 

N{ni6i,n262) = ^l{dH{x' ,Xi) = ni6i} -^lidnix" ,Xij) = n2S2} 
i=i j=i 

Ml 

= ^l{dHix',Xi) = ni6i}-N,{n252) (A.9) 

i=l 

where x' and x" designate (xi, . . . ,Xni) and {xm+i, ■ ■ ■ ,Xn), respectively, and where !{•} denotes 
the indicator function of an event. We now treat each one of the four cases pertaining to the 
combinations of both 61 and 82 being or not being members of I{Ri) and T(i?2), respectively. 



Case 1: 5i G I^{Ri) and 62 G X^(i?2) 

For a given, arbitrarily small e > 0, consider the event £ = {N{ni5i,n2S2) > e"^}. If both the 
number of indices i for which duix' ,Xi) = ni6i is less than e"^*^ and for each i, Ni(n2S2) < e"'^^, 
then clearly, the event £ does not occur. Thus, for £ to occur, at least one of these events must 
occur. In other words, either the number of indices i for which dnix', Xi) = ni5i is larger than e"!*^ 
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or there exist i for which Ni{n2S2) > e"^^. The probabihty of the former event is upper bounded by 
g-e"i«(ri.ie-i) Subsection A.l). Similarly, the probability of the latter, for a given i, is bounded 
by e-e"2e(„2e_i) rj.^^^^ probability of the union of events [J^{Ni{n2d2) > e"^'^} is upper bounded 
by Mie-^"^'("2e-i) ^ ^mRi . ^-e"2^{n2e-i) ^ ^i^icti is st^iii double exponential in n. Thus, 

Therefore, 

E{N^/'^{niSi,n2S2)} < 0^/*^ ■ Pr{N{nidi,n262) = 0} + e"'/^ • Pr{l < N{ni6i,n262) < e"^} 

< e"^/^ • Pr{iV(ni(5i, n2(52) > 1} + e"-^/^ • Pv{S} 

< e"^/^-£;{7V(ni5i,n252)} + e"^/^-Pr{f}, (A.IO) 

which is exponentially upper bounded by e.'^[^+^K^i)+{^-^)h(^2)-^r^2] gj^^g ^ jg arbitrarily small, 
E{N{{niSi,n2S2)} = e"[^+^''('5i)+(i-^)'^('52)-i°2]^ and the last term is double-exponential. To obtain 
the compatible lower bound, we use 

E{N^/\niSi,n262)} > ■ Vt{N (n^h,n2h) = 1} 

= Pr{7V(ni^i, 7X2(52) = 1}. (A.ll) 

Now, the event {iV(ni(5i, 712(^2) = 1} is the event that there is exactly one value of i such that 
dnix'jX) = niSi, and that for this i, there is exactly one j such that dH{x",x) = n262- As shown 
in Subsection A.l, the probability of the former is exponentially e"i[^i+'*(''i)~^°^l and the probability 
of the latter is exponentially f,MR2+h{52)-\n2\ _ by independence, Pr{A/"(ni(5i, n2(52) = 1} is 

the product, which is exponentially en{R+\h{Si)+{i-\)h{52)-\n2] _ 

Cases 2 and 3: 82 e T{R2) 
Define now the event A as 

Ml 

-4 = n {^^2^2) < exp{?i2[-R2 + h{52) - ln2 + e]}} . 

i=l 

As we have argued before, the probability of A is doubly exponentially close to unity (since the 
probability of A'^ is upper bounded by the sum of exponentially many doubly-exponentially small 
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probabilities). Now, clearly, if A occurs. 



Ml 



N{nidi,n2S2) < exp{n2[-R2 + h{S2) — ln2 + e]} • ^ l{dH{x',Xi) = niSi}. 



Thus, 



E{N^/\ni6i,n2S2)} < Pt{A} ■ E {[exp{n2[R2 + h{52) - ln2 + e]}: 



Ml 



^ l{dH{x',Xi) = niSi} 
1=1 



(A.12) 



where the second term is again doubly-exponentially small. As for the first term, we bound Pr{^} 
by unity and 

Ml iVe' 
E{ eKp{n2[R2 + h{52) -\n2 + e]}Y^l{dH{x' ,Xi) = ni5i} 



1=1 



exp{n2[i?2 + h{52) -\ii2 + e]/e} ■ E 



Ml 



^ l{dH(,x',Xi) = ni(5i} 



.1=1 



(A.13) 



where the latter expectation (cf. Subsection A.l) is of the exponential order of e^-^i^^+KSi) ln2] j£ 
5i G I%Ri) (Case 2) and e"i[«i+^('5i)-i°2]/9 §^ ^ (Case 3). Thus, in both cases, we obtain 

the desired exponential order as an upper bound. For the lower bound, we argue similarly that the 
probability of the event 

Ml 

A' = f] {Ni{n2S2) > exp{n2[i?2 + h{62) - In 2 - e]}} 
1=1 



is doubly-exponentially close to unity, and so. 



E{N^/\ni5i,n2S2)} > Pt{A}-E 



Ml 



exp{n2[-R2 + h{62) - In 2 - e]} ^ (a;', Xi) = ni6i} 



i=l 



and we again use the above result on the moments of Ylf^i ^{dnix' ,Xi) = niSi} in both cases of 
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Case 4: 6i € I{Ri) and S2 G I%R2) 
Since 61 £ I{Ri), then the event 

has a probabihty which is doubly-exponentiahy close to unity. Thus, given that A occurs, there 
are 

gni[i?i+/i(<5i)-ln2+e] < ^ < ^ni[Ri+h{Si)-ln2+e] 

indices 11,12, ■■■ ,11 for which dnix' ,x.i) = ni5i. Given L and given these indices, N{ni5i,n252) 
is the sum of LM2 = e"iI^i+^{^i)-i'i2+]+n2i?2 Bernoulh trials, l{dH{x",x) = ^2(52}, whose 

probability of success is exponentially q = e"^!'' ('52) -in 2]^ Thus, similarly as in the derivation in 
Subsection A.l, 



LM2q q > LM2 

(LM2g)i/^ q < LM2 



E{N^/'^{ni5i,n252)\A} = 

or, equivalently, in the notation of eq. (jlSp : 

Fi/vi/er„ . , X ^| ,T _ / exp{n[AH^i + (1 - X)W2]} XW^ + (1 - A)T^2 < 
E{N {n^5un252)\A} - | ^^^^^^^^^ ^ _ ^ _ ^^^^^ > q 

The total expectation should, of course, account for A'^ as well, but since the probability of this 
event is doubly exponentially small, then the contribution of this term is negligible. 

This completes the proof of eq. ()18p . 
A. 3. The function f{s,Ri,R2) 

First, we observe that the constraints 61 £ I{Ri) and 82 € T'^(i?2) can be replaced by their one- 
sided versions 5i > 6{Ri) and 82 < 5{R2), respectively, since values of 5i and 62 beyond 0.5 cannot 
be better than their corresponding reflections 1 — 5i and 1 — 62- 

Next observe that f(s,Ri,R2) can be rewritten as follows: 

fis,R^,R2) = mm{his,R^,R2)),f2{s,R,,R2)}, 

where 

fi{s,Ri,R2) = sinm[X6i + {l-X)d2] 
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subject to the constraints di > 5{Ri), 82 < 5{R2), and R + Xh{5i) + (1 — A)/i((52) > ln2, and 

/2(s, RuR2) = min{A[s5i - Ri - h{6i) + ln2] + (1 - X)[s62 - R2 - h{52) + In 2]} 

subject to the constraints 5i > d{Ri), 82 < S{R2), and R + Xh{6i) + (1 — \)h{52) < In 2. Note that 
the optimization problem associated with /i(s, Ri,R2) is a convex problem, but the one pertaining 
to f2{s, Ri, R2) is not, because of its last constraint which is not convex. 

At this point, we have to distinguish between two cases: (i) Ri > R2 and (ii) R2 < Ri (the case 
Ri = R2 will be taken as a limit Ri — > R2 of case (i)). 

The Case Ri > R2 

When Ri > R2, we have 6{Ri) < 5{R) < 5{R2). As for fi, it is easy to see that 61 = 62 = 6{R) 
is a solution that satisfies the necessary and sufficient Kuhn-Tucker conditions for optimality of a 
convex problem, and so, fi{s, Ri, R2) = s6{R). 

Consider next the function f2{s, Ri,R2)- Let us ignore, for a moment, the non-convex constraint 
R+ \h{5i) + {1— X)h{62) < 2, and refer only to the constraints 5i > 5{Ri) and 62 < S{R2)- Denote by 
f2{s, Ri, R2) the corresponding maximum without the non-convex constraint. The maximization 
problem associated with /2 is now convex and it is to see that 5f = max{J(i?i), i^^} and ^2 = 
min{5(i?2), Vs} satisfy the necessary and sufficient conditions for optimality, where Ug = 1/(1 + e**). 
This is also a solution for /2 if it satisfies the non-convex constraint, namely, if 

Xh{max{5{Ri),i^s}) + (1 - X)h{mm{5{R2),Us}) + R < In 2. (A.14) 

Whether or not this condition is satisfied depends on s. Since we are assuming Ri > R2, we then 
have sji^ > sji^, where we remind that sr = In ^ ^^^^ . Consequently, there are three different 
ranges of s: s > sr^, sr^ < s < SR-^^, and s < sr^. 

When s > sr-^ > sr^, this is equivalent to Ug < 6{Ri) < S{R2) in which case the above necessary 
condition ()A.14p becomes 

Xh{d{Ri)) + (1 - X)h{us) <ln2-R. 

To check whether this condition is satisfied, observe that h{S{Ri)) = ln2 — and so this is 
equivalent to the condition /i(i^s) < ln2— i?2, which is < 6{R2), in agreement with the assumption 
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on the range of s. Therefore, the above solution is acceptable for /2 and by substituting it back 
into the objective function, we get: 



f2{s,Ri,R2) = X[s5{Ri)-Ri-h{5{Ri)) + ln2] + {l-X)[si^s-R2-h{iys) + ln2] 

= Xs6{Ri) + {l- X)v{s,R2) (A.15) 

When s/jj > s > sji^, this is equivalent to 5{Ri) < < S{R2), in which case the condition 
(|A.14p becomes h^Vs) < ln2 — R, or equivalently, i^g < 6{R), which is s > sj^. However, sr is 
between sr^ and sr^, and so, the conclusion is that the non-convex constraint is satisfied only 
in upper part of the interval [sr^jSr-^], i.e., [sij,SijJ. In this range, 51 = 62 = i^s, and this yields 
72(5, i?2) = v{s, R). For s < sr, the condition (IA.14P no longer holds. In this case, the optimum 
solution should be sought on the boundary of the non-convex constraint, namely, under the equality 
constraint R + Xh{6i) + (1 — X)h{62) = In 2, but this coincides then with the solution to /i which was 
found on this boundary as well. Thus, for s £ [0, s^], we have f2is, Ri, R2) = s6{R). Summarizing 
our results for /2 over the entire range of s > 0, we have 



f2is,Rl,R2) 

or, equivalently, 

72(5, i?2 



sSiR) 0<s<SR 
v{s,R) SR<s<SR^ 
Xs5{Ri) + (1 - X)v{s, R2) s > SR^ 



u{s, R) < s < sr-^ 

Xs6{Ri) + (1 - X)v{s, R2) s > SR^ 



Finally, / should be taken as the minimum between fi and f2- Now, fi is linear and f2 is concave 
(as it is the minimum of a linear function in s), coinciding with /i along [0,,Sij]. Thus /2 cannot 
exceed fi for any s, and so, / = /2. Thus, 



f{s,Ri,R2) 



u{s, R) < s < sr-^ 

XsS{Ri) + (1 - X)v{s, R2) s > SR^ 



The Case Ri < R2 



In this case, S{Ri) > 6{R2). Once again, fi is associated with a convex program whose conditions 
for optimality are easily seen to be satisfied by the solution 5i = S{Ri) and 82 = 6{R2). Thus, 

fi{s,RuR2) = s[Xd{Ri) + (1 - A)5(i?2)]. 
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As for /2, let us examine again the various ranges of s, where this time, s/j^ < s_r < sji^- For 
s > sr^, we have < 5(i?2) < S{Ri) and then the condition ()A.14p is equivalent to h{i^s) < 
In 2 — R2, which is Vs < 6{R2), in agreement with the assumption. This corresponds to 61 = 5{Ri) 
and 52 = 1^3, which yields 

f2{s, Ri,R2) = Xs5{Ri) + (1 - X)v{s, R2). 

For sr^ < s < sr^, which means 5(i?2) < i^s < ^Ri), condition ()A.14p is satisfied with equality, 
and the corresponding solution is 5i = 6{Ri) and 62 = 5(R2), which yields 

f2{s,Ri,R2) = s[X6{Ri) + (1 - X)6{R2)]. 

For s < sr-^, eq. (|A.14p is not satisfied, and we resort again to the boundary solution, which, as 
mentioned earlier, is the same as /i. Summarizing our findings for the case Ri < R2, and applying 
similar concavity considerations as before (telling us that / = /2)! we have: 



f{s,Ri,R2) 
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